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Epigenetic information is available from contemporary organisms, but is difficult to track back in evolutionary time. 
Here, we show that genome-wide epigenetic information can be gathered directly from next-generation sequence reads of 
DNA isolated from ancient remains. Using the genome sequence data generated from hair shafts of a 4000-yr-old Paleo- 
Eskimo belonging to the Saqqaq culture, we generate the first ancient nucleosome map coupled with a genome-wide 
survey of cytosine methylation levels. The validity of both nucleosome map and methylation levels were confirmed by the 
recovery of the expected signals at promoter regions, exon/intron boundaries, and CTCF sites. The top-scoring nucle- 
osome calls revealed distinct DNA positioning biases, attesting to nucleotide-level accuracy. The ancient methylation 
levels exhibited high conservation over time, clustering closely with modern hair tissues. Using ancient methylation 
information, we estimated the age at death of the Saqqaq individual and illustrate how epigenetic information can be used 
to infer ancient gene expression. Similar epigenetic signatures were found in other fossil material, such as 110,000- to 
130,000-yr-old bones, supporting the contention that ancient epigenomic information can be reconstructed from a deep 
past. Our findings lay the foundation for extracting epigenomic information from ancient samples, allowing shifts in 
epialleles to be tracked through evolutionary time, as well as providing an original window into modern epigenomics. 



[Supplemental material is available for this article.] 

Ancient DNA research started in the mid-1980s with the successful 
cloning and sequencing of a short mitochondrial DNA fragment 
from the quagga zebra, a species that became extinct in the early 
20th century (Higuchi et al. 1984). Soon after, the invention of PCR 
unlocked access to this fragmented and degraded DNA material 
(Paabo 1989), making it possible to amplify short gene markers of 
interest and compare their sequence to that from extant organisms. 
This illuminated a range of topics ranging from the reconstruction 
of the evolutionary origins of several now-extinct iconic mammals 
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(Orlando et al. 2003; Krause et al. 2006), the evaluation of the 
possible role played by major past climatic changes in driving mega- 
fauna extinctions (Shapiro et al. 2004; Campos et al. 2010; Lorenzen 
et al. 2011), to the identification of the pathogens responsible for 
massive historical outbreaks (Taubenberger et al. 1997). 

However, before the advent of next-generation sequencing 
(NGS) platforms, the amount of ancient sequence information one 
had access to was limited to several tens of thousands of nucleo- 
tides at best (Noonan et al. 2005, 2006), and until very recently, 
sequencing whole ancient mitochondrial genomes was considered 
a major achievement (Cooper et al. 2001; Krause et al. 2006). 
Parallel sequencing of millions to billions of short DNA fragments 
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has revolutionized ancient DNA research, and today a series of 
ancient genomes has been reconstructed from humans (Rasmussen 
et al. 2010, 2011; Keller et al. 2012; Raghavan et al. 2013), archaic 
hominins (Green et al. 2010; Reich et al. 2010; Meyer et al. 2012), 
the woolly mammoth (Miller et al. 2008), and several microbial 
pathogens (Bos et al. 2011; Martin et al. 2013; Schuenemann et al. 
2013; Yoshida et al. 2013). Those mainly date back to recent his- 
torical periods or the Late Pleistocene, but most recently, the 
characterization of a 560,000- to 780,000-yr-old horse draft ge- 
nome revealed that genomic information could be retrieved over 
much longer evolutionary time scales, probably up until the last 
million years (Orlando et al. 2013). 

Ancient genomes have provided important new insights into 
human evolution and dispersals (Rasmussen et al. 2010, 201 1; Keller 
et al. 2012; Raghavan et al. 2013), revealing an admixture between 
contemporary human ancestors and archaic hominins (Green et al. 
2010; Reich et al. 2010; Meyer et al. 2012) and multiple early human 
expansions into both Asia and North America (Rasmussen et al. 
2010, 201 1). The information gained from these samples has largely 
been limited to nucleotide polymorphisms. Unlike mutations, epi- 
genetic modifications do not alter the underlying DNA sequence, 
but can be inherited across cell divisions and from parents to off- 
spring and can control gene expression by reshaping cytosine 
methylation landscapes, nucleosome organization, and histone 
modification patterns. The range of biological processes that de- 
pend on some level of epigenetic regulation is diverse and includes 
imprinting (Bird 2002), transposition (Hollister and Gaut 2009), cell 
differentiation (Meissner et al. 2008), and cancer (Teschendorff et al. 
2011). In this study, we use the Saqqaq genome that was retrieved 
from an ~4000-yr-old tuft of hair of a Paleo-Eskimo from Greenland 
and sequenced to an average depth of 20x (Rasmussen et al. 2010). 
We demonstrate that NGS data can be used in the absence of bi- 
sulfite or further experimental treatment to directly infer genome- 
wide nucleosome organization and regional methylation levels, 
thereby providing the first survey of an ancient epigenome. 

Results 

Nucleosome occupancy signal 

A striking variation in read depth is apparent in the sequence data 
that underlies the Saqqaq genome. This variation correlates with 
functional regions, ranging from below genomic average (GA) in 
intergenic regions (0.9 x GA) to far above the average in coding re- 
gions (2.8 x GA) and 5 ' UTRs (4.0 x GA) (Fig. 1 A; Supplemental Table 
S2.1). Strikingly, CpG islands (CGIs) stand out genome-wide as 
highly defined regions with extreme read depth (6.5 x GA) (Fig. 1A). 
Read depth also varies dramatically on a local scale, showing a strong 
tendency to peak in regions of defined width (—200 bp) and at reg- 
ular intervals (Fig. 1B,C). We hypothesized that instead of resulting 
from alignment or sequencing artifacts, these patterns could stem 
from protection of DNA by nucleosome binding, with preferential 
degradation of linker regions between nucleosomes (Fig. ID) either 
by DNases that enter the nucleus during cell death or by post-mortem 
strand breaks (Nagata et al. 1998). Such cleavage patterns are one of 
the hallmarks of apoptosis, which happens during the final stages of 
hair formation (Botchkavera et al. 2006), and are exploited in stan- 
dard micrococcal nuclease (MNase) assays for mapping nucleosome 
location (Schones et al. 2008; Gaffney et al. 2012). In this scenario, 
the observed read depth would reflect nucleosome occupancy. 

To rule out mapping biases, potentially exacerbated by the 
short read lengths of ancient DNA (aDNA), we simulated a control 



data set ('Control') with the same number of reads and the same 
read-length distribution as the Saqqaq genome data set ('Saqqaq'). 
Control reads were randomly sampled and truncated to match 
Saqqaq read lengths from a panel of sequencing runs from modern 
human genomes based on lymphoblastoid cell lines of the Human 
Genome Diversity Project (HGDP) (Green et al. 2010; Reich et al. 
2010; Supplemental Material SI2.2). This control displayed less 
variation in read depth (Fig. 1A-C), with fewer sites showing ex- 
treme values and with an overall distribution very different from 
that of Saqqaq (Fig. IE). When restricting the comparison to unique 
regions of the genome, unaffected by short-read mappability issues 
(Supplemental Material SI2.2), the difference becomes even more 
pronounced, with Saqqaq showing much greater read-depth varia- 
tion than Control (Saqqaq SD = 32.1; Control SD = 9.1). 

The sequencing reaction and fragment length-dependent GC 
biases introduced while amplifying NGS libraries (Dabney and Meyer 
2012) could also potentially be responsible for the observed varia- 
tions in read depth. We corrected for this second source of bias by 
making use of pre-existing methods that are proficient in accounting 
for base compositional and mappability biases (Benjamini and Speed 
2012). As anticipated, the GC-corrected read depth correlates 
strongly with the original read depth (unique regions: Pearson cor- 
relation coefficient, PCC = 0.47; P < 1 x 10~ 16 ) (see Supplemental 
Material S2.3 for conservative P- value estimation) and is uncorrelated 
with GC content (PCC = 0.003; P < 0.75). Even though the level of 
read-depth variation decreased slightly for both Saqqaq and Control 
after GC-correction (unique regions: Saqqaq SD = 14.4; Control SD = 
5.4), both the regional and the local-scale variation remained 
(Fig. 1 A,C). This suggests that sequencing bias could explain part, but 
not all, of the original read-depth variation, in agreement with our 
hypothesis of nucleosome protection. All following analyses are 
based on the GC-corrected read depth unless otherwise noted. 

As neither mapping nor sequencing biases could account for 
the observed patterns, we proceeded to compare read-depth vari- 
ations to existing nucleosome occupancy maps (Supplemental 
Material SI2.3). We first evaluated the correlation coefficients be- 
tween data sets across the unique regions of a 20-kb subsection of a 
known nucleosome array region (Fig. IB; Supplemental Fig. S2.1a), 
where nucleosomes are thought to be consistently and specifically 
positioned independently of tissue type (Gaffney et al. 2012). 
Saqqaq correlates positively with both computational predictions 
(PCC = 0.47 and PCC = 0.43; P < 2 x 10~ 4 ) (Dennis et al. 2007; 
Ozsolak et al. 2007) and experimental MNase-based maps (PCC = 
0.23 and PCC = 0.43; P < 3 x 10" 3 ) (Schones et al. 2008, Gaffney 
et al. 2012). The two MNase-based maps show comparable levels of 
correlation between them (PCC = 0.38; P< 1 x 10~ 4 ), despite being 
based on related cell types (CD4+ and lymphoblastoid cells), and 
slightly lower correlations against the computational maps (PCC 
from 0.16 to 0.28; P < 7 x 10~ 2 ). These relatively low levels of 
correlation for the same experimental technique across well- 
positioned nucleosomes could suggest some level of noise in the 
state-of-the-art MNase-based occupancy maps, potentially from 
the cutting biases of the MNase enzyme (Gaffney et al. 2012). The 
uncorrected Saqqaq read depth correlates equally well with the 
experimental data sets and at even higher levels with the compu- 
tational predictions (PCC = 0.77 and PCC = 0.74; P < 1 x 10~ 16 ). 

Transcription start sites (TSSs), generally impose stronger 
positioning of nucleosomes in their vicinity, though the posi- 
tioning depends on expression level and tissue type (Valouev et al. 
2011). Across these regions, Saqqaq correlates less strongly, albeit 
still positively, with both the computational predictions (median 
PCC of 0.25; P < 1 x 10~ 16 ) and to the experimental maps (median 
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Figure 1. Paleo-Eskimo read depth reflects nucleosome occupancy. (A, left) Regional variation in read depth relative to genomic average (enrichment) 
for Saqqaq, Control, Aboriginal, and an experimental occupancy map ('Schones') (Schones et al. 2008). (Right) Saqqaq and Control regional read-depth 
variation after GC-correction. (8) Read-depth variation in a centromeric region known to harbor a 200-kb array of well-positioned nucleosomes (Gaffney 
et al. 201 2) (left) and a region with genes (right). CpG islands (green bars) correlate with elevated read depth in the Saqqaq. The variation is also observed 
in genomically unique regions (black bars), where reads down to length 25 can map. The read depth of the Control exhibits lower variance. (C) Examples 
of Saqqaq read-depth variation, GC corrected read-depth variation, Saqqaq nucleosome predictions, and experimental (Schones, from CD4+ cells) as well 
as computational (Dennis and A375) (Dennis et al. 2007; Ozsolak et al. 2007) occupancy maps in ~2-kb regions of the nucleosome array (left) and 
a transcription start site (TSS) region (right). Light gray denotes the 147-bp-long nucleosome predictions. Saqqaq read depth correlates with both the read 
depth of the ancient Aboriginal genome and the occupancy maps, but not with the Control. (D) DNA packaged around nucleosomes. We hypothesize that 
DNA wound around nucleosomes to be better protected from degradation. (£) The Saqqaq shows more variation in read depth than Control, with more 
genomic sites showing extremely low or high read depth. (F) Distribution of correlations for Saqqaq versus other sets across all promoter regions. 



PCC of 0.02 and 0.07; P < 1 x 10~ 16 ). In contrast, the two MNase- 
based maps show a comparable level of correlation to the nucle- 
osome array regions (PCC = 0.35; P< 1 x 10~ 16 ) (Supplemental Fig. 
S2.1c), consistent with the similarity of their originating cells. 
Again, the uncorrected Saqqaq read depth correlates more strongly 
to both the computational predictions (median PCC = 0.79 
and PCC = 0.76; P < 1 x 10~ 16 ) (Supplemental Fig. S2.1b) as 
well as to the MNase-based maps (median PCC = 0.17 and 0.06; 
P < 1 x 10~ 16 ), which is not unexpected given the importance of 
GC-rich sequence signals in determining nucleosome positions 
and the absence of GC-correction for the MNase-based data sets 
(Collings et al. 2010; Valouev et al. 2011). 

At actively transcribed genes, the region upstream of the TSS 
is depleted to facilitate access to the transcriptional machinery, 



while the downstream nucleosomes are strongly phased with high 
occupancy at the +1 position (Schones et al. 2008; Valouev et al. 
201 1). The read-depth profile across TSS regions closely recapitulated 
this pattern (Fig. 2A). Consistently, the accumulated read depth 
also matched the known occupancy patterns for splice sites (Sup- 
plemental Fig. S2.2), in agreement with our nucleosome protection 
hypothesis. 

We subsequently tested whether we could find evidence in 
the Saqqaq data set for the ~200-bp read-depth periodicity char- 
acteristic of nucleosome protection (147 bp) and linker region 
(50 bp) cleavage (Fig. ID). Short-time Fourier transformation 
revealed a strong signal at 200 bp at TSS regions, where nucle- 
osomes are strongly phased, and downstream (Fig. 2A). Using 
Fourier transform periodograms (Welch method), we found the 
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Figure 2. Read depth and fragment length periodicity. (A) Read-depth variation at TSS. Spectrogram 
around TSS (top) showing the strength of the periodicity signal at different wavelengths. Nucleosome 
abundance (bottom) summed over aligned transcription start sites. High occupancy at the +1 nucleo- 
some position is characteristic of transcriptional activity. (8) Spectral density (periodogram) for TSS 
regions. The frequency spectrum shows a peak in relative signal at 193 bp corresponding to the 
expected inter-nucleosome distance. (C) 5' read-end phasograms showing the distribution of distances 
between reads in gene bodies. A clear ~200-bp periodicity is apparent, consistent with the presence of 
nucleosomes (right). A short-range periodicity of ~1 0 bp is also apparent (left), corresponding to a turn 
of the DNA helix as it winds around the nucleosome. (D) Distribution of fragment sizes from ancient 
samples of horse (top), polar bear (middle), and Saqqaq (bottom) are consistent with preferential 
cleavage of exposed nucleosome-wrapped DNA strands every 10 bp. 



overall peak periodicity to be 193 bp for TSS regions (Fig. 2B). 
Similarly, strong signals around 200 bp were also observed for 
other regions, including CpG islands, gene bodies, and sequences 
known to be bound by CTCF (Supplemental Figs. S2.3-S2.6). Nu- 
cleosome protection and extensive cleavage of linker regions 
would also predict NGS reads to predominantly start at the edge of 
nucleosomes. This bias should result in a periodicity of 5' read 
starts proportional to the nucleosome inter-distance. Phasograms, 
illustrating the distance between 5' read starts (Valouev et al. 
2011), revealed the presence of periodicity in gene bodies at the 
expected length of —200 bp (Fig. 2C; Supplemental Fig. S2.8). 

Interestingly, phasograms also revealed a short-range peri- 
odicity of 10 bp, coinciding with the length of a turn of the DNA 
helix. This could reflect preferential shifts in nucleosome posi- 
tioning every 10 bp (Brogaard et al. 2012) and/or preferential 
cleavage of the DNA backbone facing away from nucleosome 
protection. The size distribution of full-length Saqqaq aDNA 
fragments (Supplemental Material SI1.7), which shows three peaks 
at 38, 48, and 58 bp (Fig. 2D; Supplemental Fig. Sl.l) and an inter- 
peak distance mirroring both the phasograms and the length of 
one DNA helix turn, could be indicative of preferential cleavage of 
the DNA backbone facing away from nucleosome protection. 
However, size distribution profiles could also be affected by other 
factors, in particular, the gel size selection performed on libraries 
post-amplification. We therefore investigated the fragment length 
distribution of libraries prepared without gel selection; first, from 



a permafrost-preserved 4400-yr-old horse 
bone and second, from a much older 
(110,000- to 130,000-yr-old) polar bear 
bone (Fig. 2D; Supplemental Material 
SI1.7; Miller et al. 2012). This revealed 
a striking pattern, with fragment abun- 
dance increasing by up to a factor of two 
across periods of 10.3 bp until fragment 
sizes compatible with nucleosomal length 
(147 bp) are reached. This ruled out size 
selection as a possible driving factor and 
supported protection of DNA strands fac- 
ing the nucleosome. It also suggests that 
nucleosome protection footprints could 
be present in tissues other than hair and 
survive over extremely long periods. In- 
terestingly, the ~10-bp periodicity was 
also observed in modern hair but not in 
modern blood sequencing data (Supple- 
mental Fig. Sl.l). This is in line with ap- 
optosis taking part in the hair differentia- 
tion process (Botchkavera et al. 2006) and 
the known short-range size periodicity in 
DNA fragmentation following apoptosis 
(Aruscavage et al. 2010). 

Finally, we asked whether the read- 
depth signal was consistent between an- 
cient hair samples by taking advantage 
of the ~100-yr-old Aboriginal Australian 
genome sequence data set (Rasmussen 
et al. 2011). The Aboriginal read depth 
clearly showed the same periodic pattern 
of variation as Saqqaq (Fig. 1C), with 
a strikingly high degree of correlation in 
unique regions unaffected by mapping 
issues (PCC = 0.87; P < 1 x 10~ 16 ), even 
though a different procedure was used for preparing DNA libraries 
(Supplemental Material SI1.3). Consistent regional variation and 
a 200-bp periodic signal were also observed in DNA reads from 
modern hair (Supplemental Table S2.1; Supplemental Fig. S2.7), 
suggesting again that DNase-dependent cleavage during apoptotic 
processes inherent to hair differentiation, rather than post-mortem 
degradation, drives the nucleosomal protection pattern observed in 
ancient samples. The fragment-length signal in the bone-derived 
samples horn horse and polar bear, however, shows that some nu- 
cleosome protection signal may also stem from necrotic tissues, in 
line with previous findings (Dong et al. 1997). 

Nucleosome map and positioning patterns 

As the variations in read depth observed in the Saqqaq sequence 
data could not be explained by experimental artifacts or alignment 
biases, but matched with all patterns reflecting nucleosome pro- 
tection, we defined a genome-wide human nucleosome map for 
the Saqqaq based on GC-corrected read-depth variations. Peaks of 
the GC-corrected read depth were called as nucleosome centers 
when showing a maximal positive value within a nucleosome- 
wide window (147 bp). Each nucleosome call was scored by the 
difference in read depth between the peak and flanking (linker) 
regions in order to capture information about both occupancy 
(peak read depth) and positioning (depletion in linker regions) 
(Fig. 3A; Supplemental Material SI2.7). We then compared the 
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Figure 3. Nucleosome calls and positioning patterns. (A) Nucleosome center positions (dyads) are called as read-depth peaks if maximal at the center of 
a running window of nucleosome length 1 47 bp. Calls are scored by the difference in read depth between the peak (p) and the average read depth of the 
left (If) and right (rf) flanking regions [score = p - (If + rf)/2]. (8) Nucleosome call abundance is shown as a function of quality score cutoff for the Saqqaq 
(blue) and the Control (red), which lacks the nucleosome signal. The difference (green) gives the expected number of true positive calls at a given score 
cutoff and, indirectly, the FDR (<1 % for the 1 .9M calls with a score cutoff >29). (C) Base composition and distribution of purine/pyrimidine sequence 
dimers across the top 25% of called nucleosomes. 



Saqqaq scores to the Control scores and estimated the false dis- 
covery rate (FDR) by assuming that calls in the Control are all false 
(Fig. 3B; Supplemental Material SI2.8). We found that the 25% top- 
scoring calls (2.66M; score > 22.5), spanning 13% of the genome, 
have an FDR of 5% (Supplemental Fig. S2.10). 

DNA sequence determinants of nucleosome positioning are 
well-described in model organisms (Collings et al. 2010; Ioschikhes 
et al. 2011; Brogaard et al. 2012; Struhl and Segal 2013) such as yeast 
and Caenorhabditis elegans, but are only beginning to be deciphered 
in human (Kogan et al. 2006; Valouev et al. 2011; Gaffney et al. 
2012). We took advantage of our Saqqaq nucleosome map to gain 
insights into the sequence patterns that drive nucleosome posi- 
tioning in the human genome (Supplemental Material SI2.9). We 
first looked at the single nucleotide distribution across the top 
quartile of nucleosome calls in unique regions (n = 1.37M, FDR = 
8.5%). Base frequencies show the expected 10.2-bp periodicity cor- 
responding to the turn of the slightly stretched DNA helix around 
nucleosomes (Brogaard et al. 2012), with phases offset and pairwise 
opposite of each other (C vs. A and G vs. T) (Fig. 3C; Supplemental 
Fig. S2.ll). This pattern is reverse-complemented and symmetric 
across the center position, which is located at the nucleosome dyad, 
as double-stranded DNA can match in either direction. Strikingly, 
the G and C frequencies vary dramatically across the center, with 
a spike of 35% G at position -2 (+2 for C) and a drop to 14% at 
position +1 (-1 for C). Such localized changes in nucleotide fre- 
quencies are only possible if the nucleosome sequences are accu- 
rately aligned, suggesting nucleotide-level precision for a large frac- 
tion of the calls, as recently obtained for yeast (Brogaard et al. 2012). 



We next focused on the distribution of dinucleotides, which 
is known to influence nucleosome positioning (Ioschikhes et al. 
2011; Struhl and Segal 2013). Interestingly, the characteristic 
strong/weak dinucleotide pattern found in yeast (Segal et al. 2006) 
is absent in Saqqaq nucleosome calls (Supplemental Fig. S2.12), in 
line with recent human studies (Ioschikhes et al. 2011; Valouev 
et al. 2011; Gaffney et al. 2012). Instead, we detect dramatic 10-bp 
periodic variations in pyrimidine vs. purine dinucleotide fre- 
quencies, with increasing amplitudes toward the center, where 
they are reverse-complemented (Fig. 3C). This pattern has pre- 
viously been observed with a limited human nucleosome sequence 
set (Kogan et al. 2006), and is confirmed here with unprecedented 
precision and at the genome-wide scale. 

These patterns are also recovered when analyzing nucleo- 
some calls based on the uncorrected Saqqaq read depth, in which 
case the GC enrichment at the nucleosome center becomes more 
pronounced (Supplemental Material SI2.9; Supplemental Fig. 
S2.13). They likely reflect the thermodynamically optimal pack- 
aging of DNA around nucleosomes and thereby participate in de- 
fining the equilibrium location of the nucleosome (Ioschikhes 
et al. 2011). Sliding DNA away from the center would have accu- 
mulated effects across the nucleosome on the match to (1) the 
strong positional preferences near the dyad, (2) the 10-bp periodic 
variation in dinucleotide frequencies (Struhl and Segal 2013), (3) 
the increasing amplitude of this variation, and (4) the highly 
strand-specific signal, which is reverse-complemented at the cen- 
ter position. Additionally, the sequence patterns observed here 
predict that nucleosomes should be repelled from linker regions, 
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where strand specificity must switch 
again when approaching the next nucle- 
osome. We conclude that the combined 
effect of these individually weak compo- 
sitional biases may determine a precise, 
thermodynamically optimal positioning 
of individual nucleosomes. 

Cytosine methylation signal 

We next focused on detecting cytosine 
methylation at CpG dinucleotides, as 
experimental evidence supports the long- 
term survival of methylated cytosines 
(5mC) in aDNA fragments (Briggs et al. 
2010; Llamas et al. 2012). Genome-wide 
methylation maps of the human genome 
have already been reconstructed using 
high-throughput sequencing in combi- 
nation with a variety of approaches. In 
the case of bisulfite sequencing (Krueger 
et al. 2012), unmethylated cytosines are 
chemically converted into uracils, gener- 
ating cytosine to thymine (C— >T) muta- 
tions that can be located in the human 
genome at the single-nucleotide level. 

Similar chemical conversions are 
known to occur naturally post-mortem, 
through the hydrolytic deamination of 
cytosines into uracils (Hofreiter et al. 
2001). Some Taq DNA polymerases such 
as Platinum Taq DNA Polymerase High 
Fidelity (HiFi) can replicate through ura- 
cils, copying them as native thymine 
residues and thereby introducing C^T 
misincorporations in the pool of mole- 
cules amplified. Such misincorporation 
rates increase toward sequence starts 
where deamination rates are inflated due 
to the presence of single-stranded over- 
hangs (Briggs et al. 2007). However, with Pfu DNA polymerases 
that cannot bypass uracils (e.g., Phusion), such misincorpora- 
tions should vanish (Rasmussen et al. 2010) except at 5mC sites 
where post-mortem deamination transforms 5mC residues into 
thymines, which represents a native template for all DNA poly- 
merases. We therefore hypothesized that C^T misincorporation 
events observed in the Saqqaq sequence reads generated following 
library amplification with Phusion could be used to track ancient 
5mC residues and reveal genome-wide information about ancient 
DNA methylation levels (Supplemental Fig. S3.1; Supplemental 
Material SI3.1). 

We first tested this prediction by analyzing mismatch pat- 
terns in reads starting at CpGs, which are the main targets of 
methylation in mammals (Fig. 4A; Lister and Ecker 2009). Fo- 
cusing on the first position where deamination rates are maxi- 
mal (Briggs et al. 2007), we observed a 5.04-fold increase in C^T 
errors for Phusion reads starting at CpG. Reads sequenced with 
HiFi also showed an increase, albeit smaller (1.74-fold), sug- 
gesting a higher fidelity for T than U. This pattern was absent (1) 
for Phusion reads starting with CpA, CpT, and CpC (Fig. 4B), (2) 
at inner sequence positions known to be affected by lower post- 
mortem deamination rates (Supplemental Fig. S3. 2), and (3) 
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Figure 4. Substitution rates at CpG reveal methylation of DNA. (A) C^T mismatch rates (gray) versus 
rate of other mismatches (black) between a random subset of 1,000,000 Phusion (left) or HiFi (right) 
reads mapping uniquely. Reads are split by those starting with CpG (top; 26,864 Phusion and 25,568 
HiFi reads) and other dinucleotides (bottom). (8) Mismatch frequencies for Phusion (left) and HiFi (right) 
for reads aligned to various genomic locations starting with the dinucleotides: CpG (top) and Cp[ACT] 
(bottom). (C) Distribution of M s values for three classes of promoters with low, medium, and high CpG 
densities (Supplemental Material SB. 2). (D) Methylation profile (M 5 , top) and read depth variation 
(bottom) at CTCF regions. Read depth provides a proxy for nucleosome occupancy. (£) Distribution of 
M s values across nucleotide positions covered with nucleosomes, showing a depletion in methylation 
levels within a core region (20 nt before and after the nucleosome center) that is particularly marked at 
the nucleosome center. 



among Phusion mitochondrial reads in agreement with the 
absence or low levels of methylation present in this genome (Rebelo 
et al. 2009). 

We next investigated whether C^T mismatch rates were 
lower in known hypomethylated regions than in hypermethylated 
regions. CGIs are well-characterized hypomethylated regions 
(Deaton and Bird 2011). We therefore compared CpG^TpG rates 
within and outside CGIs (Illingworth et al. 2010). Introns and 
exons residing outside CGIs (Fig. 4B) showed higher CpG^TpG 
mismatch frequencies (4.8% and 5.1%) than both CGIs and the 
genomic average (1.8% and 4.2%), supporting methylation as the 
source of the elevated mismatch frequencies observed at CpG sites 
in the Saqqaq Phusion sequence data. 

Our ability to detect cytosine methylation depends on the 
extent of post-mortem deamination rates and ultimately on se- 
quencing depth. The 5mCpG sites that did not experience a post- 
mortem deamination event are copied as regular CpGs, and 
therefore do not leave any methylation footprint in the Phusion 
sequence data set. Consequently, the absence of CpG^TpG mis- 
matches at a given location of the Saqqaq genome cannot be taken 
as proof that this locus was devoid of methylation. This could only 
be demonstrated with extensive depth-of-coverage, as in this sit- 
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uation the chances that at least one of the templates sequenced 
was deaminated are increased. An easy way to increase sequence 
coverage is to go beyond the level of single dinucleotides and re- 
cord CpG^TpG mismatches within a full genomic region including 
multiple CpG sites. Similarly, since post-mortem cytosine de- 
amination rates decline rapidly from sequencing ends (Briggs et al. 
2010), focusing on read starts should increase our ability to capture 
deamination events in the pool of molecules sequenced, thus, to 
detect CpG^TpG mismatches. 

We therefore defined a measure of regional methylation level 
based on the average CpG^TpG mismatch frequencies observed 
within a given region at read starts (M s ) (Supplemental Material 
SI3.2; Supplemental Fig. S3. 3) and applied this to a series of ge- 
nomic regions in order to demonstrate its ability to capture gen- 
uine methylation information (Supplemental Material SI3.3). 
CpG^TpG conversions at read ends and other positions within 
reads were disregarded given (1) the drop in sequence quality to- 
ward read ends, (2) the drop in post-mortem cytosine deamination 
from read starts, and (3) the presence of a significant fraction of 
inserts not sequenced over their full length. Interestingly, at pro- 
moter regions, we recovered the pre- 
viously observed negative correlation be- 
tween CpG site density and methylation 
levels (Fig. 4C; Ball et al. 2009). M s also 
reproduced the expected methylation 
pattern at CGI promoters showing sig- 
nificant strand asymmetry in the distri- 
bution of guanines and cytosines (Ginno 
et al. 2012), with increasing methylation 
levels from the TSS toward the 2 kb lo- 
cated upstream and downstream (Sup- 
plemental Fig. S3.4). In addition, we found 
a 7.4-fold reduction in M s at CGIs pre- 
dicted to be undermethylated compared 
to ubiquitously methylated CGIs (Sup- 
plemental Fig. S3. 5; Straussman et al. 
2009). We also observed the expected re- 
duction in methylation levels between 
exons and introns across splice sites (Sup- 
plemental Fig. S2.2; Laurent et al. 2010), 
with 5' splice sites showing average 
methylation levels higher than at 3' splice 
sites. Finally, we compared our methyla- 
tion estimates to experimental methyla- 
tion measurements gathered across a vari- 
ety of modern individuals and somatic 
tissues (Supplemental Material SI3.4; 
Slieker et al. 2013). For all tissues and 
individuals investigated, we found sig- 
nificant and high correlations between 
normalized modern methylation levels 
and regional M s values for regions span- 
ning 750 bp or 1000 bp around the 
CpGs from the Illumina 450K array (Sup- 
plemental Table S3.1, S3.2). Adjusted 
R-squareds were, however, found maximal 
with hair methylation levels (adjusted 
R-squared = 0.620-0.785 depending on the 
coverage threshold considered). Selecting 
CpGs from the Illumina 450K array with 
at least a twofold average difference in 
methylation levels between hair, blood, 



buccal, and saliva, we found greater M s values for Saqqaq at CpGs 
showing higher average methylation in modern hairs and lower M s 
values at CpGs showing higher methylation in other tissues (Sup- 
plemental Figs. S3. 6, S3. 7). Finally, we used this set of CpG and M s 
values calculated on the Saqqaq data to perform unsupervised hi- 
erarchical clustering based on individual methylation profiles and 
found that the Saqqaq grouped together with modern hair tissues 
(Fig. 5; Supplemental Figs. S3.9-S3.13). This holds true regardless 
of the coverage threshold implemented during data filtering, 
suggesting that our approach is largely robust to regional variation 
in sequence coverage (Supplemental Material SI3.4). Overall, this 
supports the validity of M s and the methylation information re- 
covered from the Saqqaq Phusion sequence data. 

Correlation of nucleosome and methylation signals 

We next compared our nucleosome calls and methylation proxy at 
CTCF binding sites. CTCF-bound sites provide anchor points for 
arrays of well-positioned nucleosomes stretching over ~4 kb in the 
human genome (Fu et al. 2008) and play a key role in the regula- 
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Figure S. Unsupervised hierarchical clustering of tissue methylation profiles. M s -based methylation 
levels of the Saqqaq individual are compared to the methylation profiles of five modern donors (PT1 , 
PT2, PT3, PT4, and PT5) across four tissues (blood, buccal, saliva, and hair). M s calculations were based 
on 2000-bp-wide genomic regions centered on each locus from the Illumina 450k array, disregarding 
those that showed less than 100 CpG sites at read starts (Supplemental Material SI3.4). The final set 
includes a total number of 7383 CpG sites. 
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tion of gene expression (Bell et al. 2001) and in shaping the 3D 
structure of chromosomes (Handoko et al. 201 1). Importantly, our 
map displays the characteristics of in vivo nucleosome positioning 
around CTCF sites (Valouev et al. 2011), indicating that nucleo- 
somes do not revert to a more in vitro-like positioning post-mortem 
(Fig. 4D). Welch's FFT analysis (Supplemental Fig. S2.5) showed 
a sharp 182-bp spacing signal in CTCF flanking regions, consistent 
with the ~185-bp spacing reported by Fu and colleagues (Fu et al. 
2008). Strikingly, we found nucleosome calls out-of -phase with M s 
(Fig. 4D), in agreement with recent surveys reporting that nucleo- 
some occupancy and accessibility to GpC methyltransferase are 
anti-correlated in human IMR90 cells (Kelly et al. 2012). Signifi- 
cantly negative correlations were found following M s calculation on 
sequence data sets down-sampled to even depth, suggesting that 
this analysis was not affected by the greater power achieved to detect 
methylation at positions with greater read depth (Supplemental 
Table S3. 7). Minimal M s values at anchor points (<0.024) provided 
a direct measure of 63 bp for the footprint of the insulator CTCF 
protein that cannot bind to methylated DNA. This is consistent 
with the range of 32-64 nt recovered from direct DNAse treatment 
(Fu et al. 2008). Our findings add hair shafts to the empirical evi- 
dence available for characterizing the features of chromatin orga- 
nization at CTCF flanking regions that is so far limited to a number 
of cell lines in humans. 

Based on the strong consistency between our nucleosome and 
methylation data at CTCF regions, we decided to survey methyl- 
ation variations across nucleosomes interspersed across the whole 
genome. We evaluated M s positionally across our top quartile of 
nucleosome calls from unique regions. Methylation levels were 
found to vary dramatically across nucleosomes, with a depletion 
(position -20 to +20) and a sharp drop at the center position 
(Fig. 4E). Intriguingly, the zone depleted for methylation is 
enriched in CpG dinucleotides (and strong dinucleotides in gen- 
eral, CC, CG, GC, GG; position -30 to +30), suggesting that the 
presence of nucleosomes protects DNA from methylation in vivo, 
in agreement with recent experimental results in HeLa cells (Felle 
et al. 201 1). Additionally, the drop in methylation levels could ease 
steric constraints while wrapping DNA around the nucleosome, 
consistent with the known reduction in CpG deformability fol- 
lowing methylation (Perez et al. 2012) and the relatively increased 
difficulty to assemble nucleosomes in vitro on methylated tem- 
plates compared to methylation-free templates (Buttinelli et al. 
1998). 

Gene expression inference 

Since gene expression is influenced by epigenetic marks, we rea- 
soned that our data could indirectly re- 
veal information about transcriptional 
regulation in ancient cells. DNA methyl- 
ation is often, but not always, associated 
with gene silencing. Methylation at the 
first exon within gene bodies hinders 
further elongation by the transcriptional 
machinery and is tightly linked to gene 
expression down-regulation (Brenet et al. 
2011). We therefore ranked all hgl8 gene 
annotations showing sufficient sequence 
coverage (Supplemental Material SI3.5) 
according to M s values at the first exon 
(Supplemental Table S3. 6). This provided 
a list of candidate accessions with low, if 



Methylation ratio 



any, expression levels. Following Ball et al. (2009), we further cal- 
culated the ratio R s of gene body to promoter methylation as 
a proxy for gene expression, with low (high) R s values indicating 
low (high) expression levels. The vast majority of genes showing 
highest M s values at the first exon was found in the first quartile of 
R s values (100.0% for the genes with top-99% M s values at the first 
exon, and 88.3% for top-95%) (Supplemental Material SI4), sug- 
gesting strong consistency across both methylation-based expres- 
sion proxies. We next recovered gene accessions for a set of pro- 
teins known to be expressed in hair shafts (Lee et al. 2006) 
and found that they represented a group of transcripts with greater 
R s values than the overall distribution of all genes annotated 
(Kolmogorov-Smirnov test, P- value = 0.00152) (Supplemental 
Material SI4). This is in line with the latter being a mixture of 
expressed and silenced genes. 

High R s values predicted expression for a range of keratin 
transcripts, including keratins 71 and 85 (Supplemental Tables 
S4.1-S4.3). Keratins 71 and 85 are known components of the inner 
root sheath of hair follicles and hair shaft cortex and medulla, re- 
spectively (Moll et al. 2008; Langbein et al. 2010). We also found 
low R s values, suggesting low expression, if any, for nonhair- 
specific keratins, such as keratin 79 (Supplemental Tables S4.1- 
S4.2). Moderate to high R s values also confirmed the presence of 
a number of proteins involved in cellular adhesion and cytoskeleton 
organization, including plakophilin 1, plakophilin 3, desmoplakin, 
periplakin, and plectin, in agreement with the importance of des- 
mosomes, hemi-desmosomes, and/or adherens junctions in hair bi- 
ology (Bazzi et al. 2009). We also predicted high levels of trichohyalin 
(TCHH), a protein known to confer mechanical strength to the hair 
follicle inner root sheath. Overall, our R s predictions are in line 
with the biology of hair formation. 

As a further validation of our expression predictions, we used 
expression data from modern hair follicles (Supplemental Material 
SI4) to rank genes into 10 groups of increasing expression, looking 
for correlation with R s . We found a significant positive correlation 
with the expression groups, indicating that our approach can 
provide information about the transcriptional state of ancient cells 
(Fig. 6A). We then selected two additional measures known to 
correlate with expression, and calculated those using the Saqqaq 
data: (1) the presence of a strongly positioned +1 nucleosome 
(Fig. 2A; Valouev et al. 2011); and (2) the level of downstream 
regularly spaced phasing of nucleosomes (Schones et al. 2008). 
Those measures also showed significant positive correlation with 
the expression groups (Supplemental Material SI4; Fig. 6B,C). 
When extending to a more fine-grained division of expression, 
a weaker, albeit more significant correlation was observed, as 
both the measure and ranking become more sensitive to noise. 
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Figure 6. Nucleosome and methylation maps as proxies for ancient gene expression. Relationship 
between three measures assessing gene expression. (A) Methylation ratio (R s ), a measure of methylation 
in promoter versus gene bodies. (8) First nucleosome occupancy, average read depth over the TSS +1 
nucleosome region. (C) Phasing strength, a measure of strength of the periodicity between neighboring 
nucleosomes across the TSS region by Fourier transform analysis. All display a significant correlation with 
expression as measured by microarrays in modern hair follicles. 



Genome Research 461 



www.genome.org 



Pedersen et al. 



We further selected the top 1000 genes showing maximal R s values 
and the top 1000 genes showing minimal R s values as candidates of 
genes with high and low levels of expression in the Saqqaq hairs. 
Functional enrichment analyses (Huang et al. 2009) of those 
down-regulated candidates revealed chymotrypsin and trypsin- 
like enzymes, homeobox, and genes involved in signalization, cell 
adhesion, ionic channels, glycoproteins, muscle proteins, and 
proteins integral to plasma membrane. Up-regulated candidates 
were enriched in categories involving the ubiquitin ligase com- 
plex, phosphorus metabolic process, ligase, inorganic anion trans- 
port, and genes associated with metal-binding activities (Supple- 
mental Material SI4). 

Prediction of the age at death 

Aging is increasingly recognized as a developmentally regulated 
program involving epigenetic modifications at different stages of 
our lives (Boyd-Kirkup et al. 2013). A number of CpG sites in the 
human genome have been shown to undergo age-associated 
changes in methylation (Alisch et al. 2012), and linear models 
relating age and methylation levels have even been described 
(Koch and Wagner 2011; Johansson et al. 2013). Such models 
provide a unique opportunity to predict the age of a given in- 
dividual based on cytosine methylation levels, although with 
limited precision. We used the methylation information recovered 
from the Phusion sequence data at several CpG sites undergoing 
age-associated changes to propose an age at death for the Saqqaq 
individual. Following the framework presented by Koch and 
Wagner (201 1), we focused on four particular CpGs for which age- 
methylation linear models have been established across a variety 
of tissues and cell types (Supplemental Material SI3.6). As the latter 
did not include hair samples, we first tested whether such models 
could accurately predict the age of five living donors based on their 
hair methylation levels (Slieker et al. 2013). Two CpGs (cg23571857 
and cg25 148589) showed large differences between predicted and 
real age (SD = 9.4 and 12.4 yr, respectively) and were disregarded 
(Supplemental Fig. S3. 14). However, two other CpGs (cg07533148 
and cg01530101) provided reliable age estimates, with differences 
between predicted and real ages within 1.7-12.4 yr, in agreement 
with the error margin originally reported for such approaches 
(Supplemental Fig. S3. 14; Supplemental Material SI3.6; Koch and 
Wagner 2011). We therefore used those loci to predict the age at 
death of the Saqqaq individual. Estimating M s for a 2000-bp-wide 
region centered on each CpG from the Illumina 450k array, we built 
a linear model relating M s and the methylation levels measured in 
hairs of five modern donors (adjusted R-squared = 0.620-0.785) 
(Supplemental Table S3.1) in order to convert M s into absolute 
methylation levels for Saqqaq at the two loci of interest (Supple- 
mental Material SI3.6). Following Koch and Wagner (2011), the 
absolute levels were used to infer age. Both CpGs considered pro- 
vided strikingly similar age estimates ranging from 44.1 to 69.3 yr 
and 52.1 to 64.1 yr, respectively (95% CI across all analyses) 
(Supplemental Fig. S3. 15). Considering the prediction error mea- 
sured in our five modern donors, this indicates that the Saqqaq 
individual was probably among the elderly when he died and was 
likely at least in his late thirties. 

Discussion 

Epigenetics complements genetics in determining the phenotypic 
state of cells and organisms (Bird 2002; Meissner et al. 2008; 
Hollister and Gaut 2009; Teschendorff et al. 2011). The extraction 



of epigenetic information from ancient samples can therefore both 
elucidate the ancient phenotypic state as well as the evolutionary 
changes of the epigenome. Yet, to our knowledge, ancient epi- 
genomic information has not been extracted genome-wide pre- 
viously, and our study is the first to report both nucleosome occu- 
pancy and methylation levels. The long-term survival of 5mC in 
ancient DNA has previously been reported using either sophisticated 
enzymatic reactions (Briggs et al. 2010) or bisulfite treatments 
(Llamas et al. 2012). In contrast, our approach does not require any 
extra treatment and simply relies on deep-sequencing following 
standard protocols. Taking advantage of a series of DNA degradation 
steps preceding cell death, it provides both genomic and epi- 
genomic information from a single sequence library. Accurate 
estimates of regional levels of methylation from high depth 
(>20x) whole-genome sequencing data of the 4000-yr-old Saqqaq 
individual (Rasmussen et al. 2010) was achieved despite low levels of 
cytosine deamination (Fig. 4A; Ginolhac et al. 2011). We anticipate 
that similar methylation profiles could be gathered at lower 
coverage in cases where DNA is affected by extensive post-mor- 
tem damage. 

Importantly, our framework can also deliver methylation in- 
formation in cases where other molecular tools than the Phusion 
are used. In particular, UNG-EndoVIII treatment of ancient DNA 
extracts prior to library construction has become a standard pro- 
cedure for limiting the extent of artifactual mutations in final 
ancient genome assemblies (Briggs et al. 2010; Bos et al. 2011; 
Meyer et al. 2012; Schuenemann et al. 2013). This method shows 
great efficiency at eliminating post-mortem deamination by- 
products at regular cytosines (i.e., uracils) (Briggs et al. 2010) but 
cannot detect deaminated 5mC, as the latter are nothing but reg- 
ular thymines (Supplemental Fig. S3.1). Therefore, similar to our 
use of Phusion, CpG-+TpG misincorporations observed at read 
starts following UNG-EndoVIII treatment offer an opportunity to 
track ancient CpG methylation footprints. One proximate per- 
spective from our work is, therefore, to apply our framework to the 
high-quality Neandertal and Denisovan genomes where such 
UNG-EndoVIII treatments have been used (Meyer et al. 2012) in 
order to contrast genome-wide ancient bone methylation profiles 
in archaic hominins and modern humans. 

Our ancient nucleosome map is of similar accuracy to modern 
MNase-based maps, but is unlikely to be subject to the same set 
of sequence biases that result from MNase cutting preferences 
(Chung et al. 2010). In our case, the fragmentation happens post- 
mortem, likely by a combination of cleavage by endogenous DNases 
and spontaneous de-purination processes (Dong et al. 1997; 
Botchkavera et al. 2006; Briggs et al. 2007). It therefore offers a 
unique view of nucleosome occupancy. The distinct positioning 
patterns with strong nucleotide preferences for individual positions 
have not been reported genome-wide for humans before and show 
that a large fraction of the nucleosome calls are at nucleotide-level 
resolution. Their strand-specific, oscillating nature coupled with 
reverse complementation at the dyad show how optimal nucleo- 
some positioning may be specified by the accumulated effect of 
individually weak compositional biases. In particular the strand- 
specific purine versus pyrimidine patterns, which are reverse- 
complemented at the dyad, would contribute strongly to this. 
Furthermore, the observed hypomethylation of DNA spanning 
nucleosome cores might also partake in nucleosome positioning 
by increasing DNA deformability (Buttinelli et al. 1998; Perez 
et al. 2012). All in all, our findings illustrate how aDNA can offer 
an original source of information to advance our understanding 
of nucleosome biology. 
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Our results show that both methylation tracts and nucleo- 
some occupancy patterns can be preserved for significant time 
periods after death in both hairs and bones. How long such signals 
survive in a full range of environments and tissues remains to be 
determined; however, the sequence data underlying the ancient 
Aborigine genome demonstrate that genome-wide epigenetic in- 
formation could be recovered after one century in warm environ- 
ments (Rasmussen et al. 201 1). This age limit is extended by at least 
three orders of magnitude, to over 100,000 yr, in cold environ- 
ments, as revealed by the polar bear shotgun sequence data (Miller 
et al. 2012). This time range opens the possibility to track major 
epigenomic modifications at different time scales: over a few 
generations, by following major changes in diet and epidemics 
using medical archives, potentially including the extensive col- 
lections of formalin-fixed and paraffin-embedded biopsies (Kerick 
et al. 2011); but also over thousands of years by following epi- 
genomic changes over long-term environmental changes such as 
those from the Last Glacial Maximum. Epigenomic analysis of 
ancient DNA therefore paves the way for charting shifts in the fre- 
quency of epialleles over time, providing a direct way of detecting 
epigenetic adaptations to environmental conditions in analogy to 
how positive selection is detected with genomic data. In addition to 
investigating the full spectrum of possible changes driving adapta- 
tion of human populations to their environment, these data will 
more generally contribute to evaluating the potential of epigenetic 
modifications, besides mutations, as a major evolutionary force. 

Interestingly, the Saqqaq methylation profile was found to 
cluster together with modern hair shafts to the exclusion of other 
tissue types (Fig. 5; Supplemental Figs. S3.9-S3.13). Such infor- 
mation can be used to demonstrate the absence of major con- 
tamination from other sources, and therefore can prove essential 
for authenticating ancient human sequence data sets in a number 
of cases. One important authentication criterion when working on 
ancient humans is the ability to demonstrate that the ancient ge- 
netic signature does not match that of any of the coworkers, from 
field archeologists and anthropologists to molecular biologists 
performing ancient DNA analyses in the lab (Gilbert et al. 2005). 
Yet, such analyses are not always possible (e.g., in cases where 
archeological remains were discovered a long time ago and where 
not all persons who have been in direct contact with the material 
could be tracked). In such cases, methylation-based clustering of 
the ancient specimen with expected profiles of the fossil material 
(mostly bones, teeth, and/or hair) can provide evidence supporting 
the absence of contamination, as the latter will most likely origi- 
nate from different types of tissues, such as skin, saliva, and pos- 
sibly blood. This information can be lined up with post-mortem 
DNA damage signatures (Krause et al. 2010) to further demonstrate 
the absence of contamination by fresh DNA material. In cases 
where fossil specimens and coworkers originate from the same 
geographical region (e.g., ancient European specimens studied by 
European researchers), SNP variation will likely support that the 
ancient specimen could belong to the same population back- 
ground as the coworkers, leaving again epigenetic signatures as an 
invaluable complement to the analysis of DNA damage patterns in 
the final authentication. It is noteworthy that particular sample 
decontamination procedures such as bleaching prior to DNA 
analyses can also introduce DNA modifications mimicking bona 
fide ancient fragmentation patterns (Garcia-Garcera et al. 201 1). In 
such cases, modern contaminants cannot easily be falsified by the 
analysis of DNA damage patterns, and methylation profiles might 
reveal an important line of evidence for authenticating the se- 
quence data and confirming the absence of contamination. 



Taking advantage of age-associated changes in the methyla- 
tion levels observed at particular CpGs, we proposed an estimate 
for the age at death of the Saqqaq individual (Supplemental Ma- 
terial SI3. 6). Our approach relies on linear models available from 
the literature that relate age and the methylation levels measured 
at two loci in contemporary humans across a range of tissues. We 
estimate the accuracy of our method using five modern donors and 
predicted ages closely matching expectations within 1.7-12.4 yr 
(Supplemental Fig. S3. 14). Central to our predictions is the as- 
sumption that similar age-associated changes in methylation levels 
are at play in ancient and contemporary human populations. Yet, 
recent evolutionary changes in human diet, health conditions, and 
environment could have shifted the methylation clock, with slower 
or faster methylation rates at different CpGs in ancient and con- 
temporary populations. Testing the robustness of the age-methyla- 
tion models used in this study to a variety of environmental and 
temporal contexts will require further work. Current genome-wide 
analyses of age-associated methylation changes Qohansson et al. 
2013) will likely extend the list of loci that could be used for esti- 
mating the age at death of ancient individuals well beyond the sole 
two CpGs investigated here. Assuming a constant methylation 
clock of past and ancient populations, it is likely that including 
multiple and large numbers of loci within a single analysis will 
enable achieving better age predictions. 

Finally, our approach opens the possibility of predicting an- 
cient gene expression levels and functional interpretation. To- 
gether with ancient proteomics (Cappellini et al. 2012), it provides 
additional phenotypic information from ancient individuals, 
which can complement functional SNP genotyping. Our study 
also shows that nucleosome protection can cause dramatic varia- 
tion in read depth on a local scale across the genome, which dra- 
matically increases the amount of sequencing needed to uniformly 
call SNPs with high confidence. Taken together with the observa- 
tion that nucleosome-associated DNA fragments are more prone to 
be preserved than linker regions, this study has implications for 
designing aDNA studies, genome-wide target-enrichment pro- 
cedures (Fu et al. 2013), and forensic analyses, and for the expand- 
ing field of sequencing preserved clinical samples. 

Methods 

Sequence data sets 

Sequencing and mapping of the Saqqaq and Aborigine genomes are 
described in full detail in their respective publications (Rasmussen 
et al. 2010, 2011). Importantly, for evaluating the robustness of 
the nucleosome signal, the protocols differ in several respects, 
including the method used for constructing DNA libraries, the 
DNA polymerases used for amplifying DNA libraries (Phusion, 
Finnzymes, vs. HiFi, Life Technologies), the sequencing platform 
used (Illumina GA IIx vs. Illumina HiSeq 2000), and the mapping 
software used (SESAM vs. BWA) (see Supplemental Material SI 1.1 
and SI 1.3 for details). For the methylation analysis, Saqqaq reads 
were remapped using BWA with default parameters for refined 
adapter handling and for allowing indels (Li and Durbin 2009). 
The modern hair data set was generated following the same pro- 
cedure as Saqqaq (Supplemental Material SI 1.6). The ancient 
polar bear reads were available from the literature (Miller et al. 
2012), trimmed for adapter sequences, and mapped against de 
novo assembled scaffolds of the polar bear genome using BWA 
and standard parameters, except that the seed was disabled. The 
ancient horse sample was provided by one of us (A.T.) and se- 
quenced at the Center for GeoGenetics on a HiSeq 2000 and 
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mapped using BWA with standard parameters and disabling the 
seed (see Supplemental Material SI1.4 for details). 

The Control set was constructed from modern sequencing 
libraries to have the same number of reads and length distribution 
as the Saqqaq library. Reads were randomly sampled and truncated 
to match Saqqaq reads (Supplemental Material SI1.2). The se- 
quence data recently released for a series of modern horse genomes 
generated from fresh blood (Orlando et al. 2013) was also used to 
investigate possible short-range periodicities in the size distribu- 
tion of library inserts. Those analyses were based on full-length 
DNA inserts that were obtained by collapsing overlapping paired- 
end reads before mapping (Supplemental Material SI 1.7). 

GC correction 

GCcorrect (Benjamini and Speed 2012) was used to calculate the 
association of GC-content with read depth and, in turn, to esti- 
mate the expected read depth for all unique positions over all 
relevant read lengths in the human genome. The observed read 
depth at a given position was normalized for GC content effects by 
subtracting the expected read depth, summed across all read 
lengths, to yield the final GC-corrected read depth (Supplemental 
Material SI2.2). 



Anchor site analysis 

Nucleosome occupancy, represented by the GC-corrected read 
depth, was plotted around multiple anchor points in the genome: 
UCSC TSSs, CTCF sites (Supplemental Material SI3.3), and UCSC 
splice sites (Supplemental Material SI2.4). All instances of each 
group of anchor points were aligned and the mean GC-corrected 
read-depth of the surrounding regions was plotted. 

Read-depth periodicity 

Spectral density plots (periodograms) across CpG islands, TSSs ± 
1000 bp, gene bodies, and CTCF sites ± 1000 bp were made using 
Fourier transform (Welch's method). To remove low-frequency 
variations and constant offsets, we subtracted the background 
signal estimated by exponential curve modeling. Short-time 
Fourier transform was used to make spectrograms of the spec- 
tral decomposition across anchor sites (Supplemental Material 
SI2.5). 

Gene body (UCSC Genes) phasograms (Valouev et al. 2011) 
were produced using raw uncorrected reads and counting the dis- 
tance between pairs of 5' ends on the same strand at positions with 
at least five reads. Background signal caused by local variation in 
read depth was subtracted using exponential curve modeling 
(Supplemental Fig. S2.5a). Modes of the autocorrelation were used 
to infer dominant long-range and short-range phasing. 

Nucleosome calls 

Nucleosomes were called using a sliding window of 147 bp over 
regions with a positive GC-corrected read depth. A given position 
was called as the center of a nucleosome when showing the highest 
GC-corrected read depth within a 147-bp window centered on that 
position (Supplemental Fig. S2.9). A score incorporating both oc- 
cupancy and positioning was then calculated for each nucleo- 
some defined as the read depth over the peak (occupancy) minus 
the mean read depth of the flanking regions (positioning) (see 
Supplemental Fig. S2.9). Repeating the same procedure over the 
Control set enabled us to calculate the FDR for any given score 
threshold. 



Nucleotide patterns across nucleosomes 

Mono- and dinucleotide distributions across nucleosomes were 
produced by ranking and stratifying nucleosome calls by score. All 
nucleosomes within each group were aligned at the center (dyad) 
position, and the average usage across each position was calculated 
(see Supplemental Material SI2.9 for details). 

Methylation signal 

Nucleotide misincorporations were identified using mapDamage 
(Ginolhac et al. 2011) over genomic regions for reads produced 
using either HiFi or Phusion DNA polymerases. While both poly- 
merases are capable of bypassing deaminated methylated cytosine 
residues (thymine), only HiFi is able to bypass deaminated 
unmethylated cytosines (uracil), providing a distinct signature for 
the methylation state of cytosines in the genome. We defined 
a proxy for regional methylation levels, M s as the fraction of CpG 
dinucleotides giving rise to TpG misincorporations at read starts 
(Supplemental Fig. S3. 3; see Supplemental Material SI3.1 for a de- 
tailed description). M s was measured in a variety of genomic re- 
gions, including splice sites (Supplemental Fig. S2.2); promoters 
stratified across three classes of CpG density (Fig. 4C,E); CGIs 
showing significant strand asymmetry in the distribution of gua- 
nines and cytosines immediately downstream from their TSS (GC 
skew) (Supplemental Fig. S3. 4); undermethylated and ubiquitously 
methylated CGIs (Supplemental Fig. S3.5); and 1500- and/or 2000- 
bp-wide regions centered on each CpG site from the Illumina 450k 
array (Supplemental Material SI3.4). Unsupervised hierarchical 
clustering was performed using normalized methylation data from 
Slieker et al. (2013) and M s -based methylation estimates derived 
from linear models relating M s and the methylation levels ob- 
served at each CpG from the Illumina 450k array on five modern 
human donors (Supplemental Material SI3.4). 

The CTCF nucleosomes patterns are derived from a set of 
12,864 published CTCF binding sites (Fu et al. 2008). We calculated 
(1) the M s score, and (2) the nucleosome occupancy (GC-corrected 
read depth) of 25 -bp sliding windows within 1 kb of the sites 
(Fig. 4D). 

Estimates of age at death were derived using linear models 
from the literature (Koch and Wagner 2011) that relate age and 
methylation levels at given CpG sites (Supplemental Material 
SI3.6). 

Expression analysis 

Three proxies for expression were defined: (1) the level of gene 
body to promoter methylation, R s ; (2) occupancy of the +1 nu- 
cleosome; and (3) strength of nucleosome phasing (see Supple- 
mental Material SI4 for details). Their respective performance was 
evaluated using expression data from 10 samples of modern hair 
(Kim et al. 2006) (GSE3058), in the absence of direct measurement 
of ancient expression levels. The modern expression data was used 
to define groups of genes with increasing expression levels (using 
10, 20, or 50 quantiles), which were ranked by each of the three 
proxies, and the Spearman correlation coefficient evaluated. Func- 
tional enrichment analyses were performed in DAVID (Huang et al. 
2009) using standard parameters. Categories showing enrichment 
scores lower than 1.2 and Benjamini-Hochberg P- values greater 
than 0.05 were disregarded. 

Data access 

All nucleosome related data sets are publicly available from our 
mirror of the UCSC Genome Browser, hgl8 assembly, "Ancient 
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Epigenomics" section (http://genome-mirror.moma.ki.au.dk/). 
The ancient horse sequence data have been submitted to the NCBI 
Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra) 
under accession number SRA105533. 
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